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Abstract 

Wc initiate the study of dimensionality reduction in general metric spaces in the context 
of supervised learning. Our statistical contribution consists of tight Rademacher bounds for 
Lipschitz functions in metric spaces that are doubling, or nearly doubling. As a by-product, 
we obtain a new theoretical explanation for the empirically reported improvements gained by 
pre-processing Euclidean data by PCA (Principal Components Analysis) prior to constructing 
a linear classifier. On the algorithmic front, we describe an analogue of PCA for metric spaces, 
namely an efficient procedure that approximates the data's intrinsic dimension, which is often 
much lower than the ambient dimension. Thus, our approach can exploit the dual benefits 
of low dimensionality: (1) more efficient proximity search algorithms, and (2) more optimistic 
generalization bounds. 
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1 Introduction 

Linear classifiers play a central role in supervised learning, with a rich and elegant theory. This 
setting assumes data is represented as points in a Hilbert space, either explicitly as feature vectors 
or implicitly via a kernel. The main advantage of the Hilbert-space model is its inner-product 
structure, which has been exploited statistically and algorithmically by sophisticated techniques 
from geometric and functional analysis to place the celebrated hyperplane methods on a solid 
foundation. However, the success of the Hilbert-space model obscures its limitations — perhaps 
the most significant of which is that it cannot represent many norms and distance functions that 
arise naturally in applications. Formally, metrics such as Li, earthmover, and edit dis tance c annot 
be embedded into a Hilber t space without distorting distan ces by a large factor Enfiol . 1969I . 
Naor and Schechtman . 2007 . Andoni and Krauthgamer , 20101 ]. Indeed, recent years have seen a 
growing interest and success in exte nding the theory of linear c l assifiers to Banach spaces and even 
to general metri c spaces, see e .g. iMicchelli and Pontil . 2004 . von Luxburg and Bousquet . 2004 . 



Hein et al.l . l2005l . iDer and Led . l2007l . IZhang et all l2009l | 



Another key concept in learning is dimensionality of the data. The dimension is known to 
control the learner's efficiency, both statistically, i.e. sample complexity, and algorithmically, i.e. 
computational runtime. This dependence on dimension is true not only for Hilbertian spaces, but 
also for general metric spaces, where the sample complexity and/or algorithmic runtime can be 
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bounded in terms of the c overing number or the doubHng dimension von Luxburg and Bousquetl . 
200i, iGottheb et J] . l2nid |. 



Our contribution. We examine data that is close to being low-dimensional, both in the scenario 
of Hilbertian spaces and in the case of general metric spaces. To illustrate the former setting, 
consider (M.'^ , ||-||2)- Let the observed sample be {xi,yi), . . . , {xn,yn) £ x {— 1) Iji and suppose 
that close to a low-dimensional linear subspace T C M^, in the sense that its 

distortion V — n Si H-^* ~ -PT(2;i)||2 is small, where Pt '■ — ?• T denotes orthogonal projection 
onto T. We prove in Section [3] that when both dim(T) and distortion rj are small, a linear classifier 
generalizes well regardless of the ambient dimension N or the separation margin. Implicit in our 
result is an optimal tradeoff between the reduced dimension and the distortion, which can be 
optimized efficiently via PCA (Principal Components Analysis). In fact, our analysis provides, to 
the best of our knowledge, the first rigorous theory for selecting a cutoff value for the singular 
values, in any supervised learning setting. Algorithmically, our approach amounts to running 
PCA with a cutoff value implied by Corollary 13. 2| constructing a linear classifier on the projected 
data {PT{xi),yi), . . . , {PT{xn),yn), and "lifting" this linear classifier to M^, exploiting the lower 
dimensionality to speed up the classifier's construction. 

We develop this approach significantly further beyond the Euclidean case, extending it to general 
metric spaces. Let the observed sample be (xi, yi), . . . , (x^, yn) £ ^ x { — 1,1), where (X,p) is 



some metric space, and consider the statistical framework proposed bv I von Luxburg and Bousauet 



2004 1 ■ realizing classifiers by Lipschitz functions. Subsequently, Gottlieb et al.l '2O10l | obtained 



generalization bounds and designed fast algorithms for metric spaces whose doubling dimension, 
denoted ddim(Af), is low (see Section [2] for definitions). The present work makes a considerably 
less restrictive assumption: namely, that the sample points lie close to some low-dimensional set. 

We present classification algorithms that adapt to the intrinsic dimensionality of the data and 
exploit it for improved accuracy and runtime complexity. First we establish in Section [J] new 
generalization bounds for the scenario where there is a set S = {xi} of low doubling dimension, 
whose distortion r] = p{xi,Xi) is small. In this case, the Lipschitz extension classifier will 
generalize well, regardless of the ambient dimension ddim(^Y); see Theorem 14. 4i Next, we design 
in Section [5] a polynomial-time algorithm that finds a near-optimal point set S. Formally, our 
algorithm achieves a bicriteria approximation, meaning that ddim(S') and rj of the reported solution 
exceed the values of an optimal solution by at most a constant factor; see Theorem 15.11 The 
overall classification algorithm operates by computing S and constructing a Lipschitz classifier on 
the modified trainin g set {xi,yi), . . . , {xn,yn), exploiting the fact it is low dimensional, say, using 



Gottlieb et aIli2O10l |. An important feature of our method is that the generalization bounds depend 



only on the intrinsic dimension of the data, and not of the ambient space. 



Relat ed Work. There is a plethora of literature on dimensionality reduction, see e.g. iLee and Verleysen 



2007l | and lBurged |20ld |. and thus we restrict the ensuing discussion to results addressing supervised 



learning. Previously, only Euclidean dimension reduction was considered, and chiefiy for the pur- 
pose of improving runtime efficiency. This was realized by projecting the data onto a random low- 
dimensional subspace — a data-oblivious technique, see e.g. [Balcan et al.l . l2006l . iRahimi and Rechtl . 
20071 . iPaul et al.l . l2012l |. On the other hand, data-dependent dimensionality reduction techniques 
have been observed empirically to improve classification performance. For instance, PCA may 
be applied as a pre-processing filter to denoise the data, a nd other algorithm combine diniension 
reduction with learning algorithms such as SVM, see e.g. Bi et al. . 20031 . Fukumizu et al. . 2004, 
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Huang and Aviventd . 120071 . IVarshnev and Willskvl . l201ll ]. Remarkably, these techniques in some 
sense defy standard margin theory as they are liable to decrease the separation margin. Our 
analysis in Section [3] sheds new light on the matter. 

There is little previous work on dimension reduction in general metric spaces. MDS (Multi- 
Dimensional Scaling) is a generalization of PCA, whose input is metric (pairwise distances); how- 
ever, its output is Euclidean an d thus effective only for metrics that are "nearly" Euclidean. 
Gottlieb and Krauthgamer considered another metric problem, of removing from an input 



set S a small fraction of points, so as to obtain a large subset of low doubling dimension. While 
close in spirit, their objective is quite different from ours, and also seems to require rather different 
techniques. 



2 Definitions and notation 



We use standard notation and definitions throughout, and assume a familiarity with the basic 
notions of Euclidean and normed spaces. We write for the indicator function of the relevant 
predicate and sgn(x) := 2 • l{x>o} ~ 1- 

Metric spaces. A metric p on a set X is a positive symmetric function satisfying the triangle 
inequality p{x,y) < p{x,z) + p{z,y); together the two comprise the metric space {X,p)- The 
diameter of a set A C X, is defined by diam(^) := sup^ y). The Lipschitz constant of 

a function f : X ^ denoted by ||/||Lip, is defined to be the infimum L > that satisfies 
1/(2;) - f{y)\ < L ■ p{x,y) for all x,y e X. 



Doubling dimension. For a metric {X,p), let \x > be the smallest value such that every 
ball in X can be covered by Xx balls of half the radius. Xx is the doubling constant of X, and 
the doubling dimension of X is defined as ddim(Af) := log2(A;t)- It is well-known that while a 
(i-dimensional Euclidean space, or any subset of it, has doubling dimension 0(d); however, low 



doubl ing dimension is strictly more general than low Euclidean dimension, see e.g. iGupta et al 
2003]. 



Point hierarchies. Let 5 be a point set of diameter 1 and minimum interpoint distance 6 > 0. 
A hierarchy 5 of a set 5 is a sequence of nested sets ^ . . . C 5f; here, t = [log2(l/5)] and 
St = S, while 5*0 consists of a single point. Set Si must possess a packing property, which asserts 
that d{v,w) > 2~* for all v,w £ Si, and a c-covering property for c > 1 (with respect to Si+i), 
which asserts that for each v G 5'j+i there exists w £ Si with d{v,w) < c ■ 2~\ Set Si is called a 
2~^-net of the hierarchy. 

Every point set S possesses a hierarchy, which need not be unique, for any value of c > 1. We 
will need the following lemma. 

Lemma 2.1. Let S be a point set, and let S be a hierarchy for S with a c-covering property. 
For every subset T C S with doubling dimension d := ddim(T), there exists a set S' satisfying 
T C S" C 5, and an associated hierarchy S' with the following properties: 

1. ddim(5') < log(23^ + 1) = 3d + o(l). 

2. Every point v £ S' is within distance 4c • 2"^ of some point w £ S'p. 

3. S' is a sub-hierarchy of S, meaning that S[ C Si for all S[ £ S' . 
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Proof. Take set T and extract from it an arbitrary hierarchy T composed of sets Tj. Note that 
each point v G T, (i > 1) is necessarily within distance 2c- 2~* of some point in Si] this is because v 
exists in St, and by the c-covering property of 5, f G must within distance X]j=4 c-2~^ < 2c- 2~* 
of some point w £ Si. 

We initiahze the hierarchy S' by setting S'q = Tq. Construct S",' for i > by first including 
the points of S'^_i. Then, for each v G Ti, v is not within distance 2c - 2"* of a point already in 
S[_^, add the closest point w € Si to v to 5^. This implies that every point in S' is within distance 
2c - + 2c - 2~*' = 4c - 2~P of some point in S'p, p < i. The hierarchy T also inherits the packing 
property of hierarchy S. 

Turning to the dimension, 'moving' points in Tj a distance of less than 2c - 2~* can increase the 
dimension by a multiplicative factor of 3. Retaining the points of the previous level in the hierarchy 
can serve to add 1 to the doubling constant. □ 



Covering numbers. The e-covering number of a metric space {X,p), denoted J\f{s, X , p), is de- 
fined as the smallest number of balls of radius e that suffices to cover X. The covering numbers may 
be est imated as follows by repeatedly invoking the doubling property, see e.g. lKrauthgamer and Lee 



2004]. 



Lemma 2.2. // {X,p) is a metric space with ddim(A') < oo and diam(Af) < oo, then 

M{e,X,p) 



< 



Learn ing. Our setting in this paper is the agnostic P^C learning model, see e.g. iMohri et al 



where examples are drawn independently from X x { — 1,1} according to some unknown 



probability distribution P. The learner, having observed n such pairs (x, y) produces a hypothesis 
h : X ^ 1}- The generalization error ¥'{h{x) ^ y) is the probability of misclassifying a new 
point. Most generalization bounds consist of a sample error term (approximately corresponding to 
bias in statistics), which is the fraction of observed examples misclassified by h and a hypothesis 
complexity term (a rough analogu e of variance in st atistics) which measures the richness of the 
class of all admissible hypotheses Wasserman . 20061 ]^ A data-driven procedure for selecting the 
correct hypothesis complexity is known as model selection and is typically performed by some 
variant of Structural Risk Minimization Shawe-Tavlor et al. . 19981 ] — an analogue of the bias- 



variance tradeoff in statistics. Keeping in line with the literature, we ignore the measure-theoretic 
technicalities associated with taking suprema over uncountable function classes. 



Rademacher complexity. For any n points Zi, . . . , Z„ in Z and any collection of functions G 
mapping Z to a bounded range, we may define the Rademacher complexity of G evaluated at the 
n points: 



1 " 

Rn{G;{Zi}) = Esup - V'crig(Zi), 



where the expec tation is over the iid random va riabl es u,: that take on zfcl with probabil ity 1/2. The 
seminal work of Bartlett and Mendelson 20021 ] and Koltchinskii and Panchenko 2002 ] established 
the cent ral role of Ra,demac her complexities in generalization bounds. We quote Theorems 12.31 and 



2:11 from l^fohri et all [20121 ]. 



^The additional confidence term, typically 0{^/log{l/ S) /n) , is standard and usually not optimized over. 
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Theorem 2.3. Let G be a collection of functions mapping Z to [0, 1] and let Zi £ Z , i = 1, . . . ,n 
be an iid sample. Then, for any 5 > 0, with probability at least 1 — 6, the following holds for all 
g^G: 



1 

¥.[g{Z)] < -Y,g{Zi) + 2Rn{G;{Z,}) + 3 



i=l 



log(2/^) 
2n 



Theorem 2.4. Let H be a collection of functions mapping X to [—1,1], and let {Xi,Yi) £ X x 
{ — 1, 1}, i = 1, . . . ,n be an iid sample. Then, for any 6 > 0, with probability at least 1 — 6, the 
following holds for all h £ H and all 7 G (0, 1); 



1 " 4 
sgn{h{X) / y)) < - V L^{h{Xi),Yi) + -Rn{H; {XJ) 



loglog2(2/7) ^ 3 /log(4/(^) 



n 



2n 



where L^f{y,y') = min(max(0, 1 — yy'/'j), 1) is the margin loss. 



The Rademacher complexity of a binary function class F may be controlled by the VC-dimension 
d oi F through an application of Massart's and Sauer's lemmas: 



Rn{F-{Zi\) < 



2dlog{en/d) 



n 



(1) 



Considerably more delicate b ounds may be obtained by estimating the covering numbers and 
using Dudley's chaining integral [Dudlevl . Il987l |: 



RJF) < inf 4a + 12 

a>0 \ 



log Mit,F,\ 



-dt . 



n 



(2) 



3 Adaptive Dimensionality Reduction: the Euclidean case 

Consider the problem of supervised classification in by linear hyperplanes, where ^ 1. 
The training sample is {Xi,Yi), i = l,...,n, with {Xi,Yi) £ M-^ x {—1,1}, and without loss of 
generality we take ||Xi||2 ^ 1 and the hypothesis class H = {x sgn{w ■ x) : \\w\\2 < 1}- Absent 
additional assumptions on the data, this is a high-dimensional learning problem with a costly 
sample complexity. Indeed, the VC-dimension of linear hyperplanes in N dimensions is N. If, 
however, it turns out that the data actually lies on a fc-dimensional subspace of M'^, Eq. ([T|) 
implies that Rn{H) < \j2k \og{en/k) /n, and hence a much better generalization for k <^ N . A 
more common benign distr ibutional ass umption is that of large-margin separability. In fact, the 



main insight articulated in iBluml 2005l | is that data separable by margin 7 effectively lies in an 
0(l/7^)-dimensional space. 

In this section, we consider the case where the data lies "close" to a low-dimensional subspace. 
Formally, we say that the data {Xi} is r/-close to a subspace T C if 

1 " 

-Y^\\P^{X,)-X,\\l<i^ (3) 
1=1 

(where Pr(") denotes the orthogonal projection onto the subspace T). In cases where ([3]) holds, 
the Rademacher complexity can be bounded in terms of dim(T) and r] alone (Theorem 13. ip . As a 
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consequence, we obtain a bound on the expected hinge- los^ (Corollary 13. 2p . This both motivates 
and guides the use of PCA for classification. 

Theorem 3.1. Let Xi, . . . , lie in with \\Xi\\2 < 1 and define the function class F = 
{x w ■ X : ll^llg < 1}. Suppose that ^ holds for some subspace T C and r/ > 0. Then 



RniF;{X^})<ml'^ + 



n 



Remark. Notice that the Rademacher complexity is independent of the ambient dimension A'^, 
and instead the distortion r] plays the role of effective dimension or inverse margin. Also note the 
tension between dim(T) and t] in the bound above — as we seek a lower-dimensional approximation, 
we are liable to incur a larger distortion. 

Proof. Denote by = , . . . , xJi) and 5-*" = {X^,...,X^) the parallel and perpendicular 
components of the points {Xi} with respect to T. Note that each Xi has the unique decomposition 
Xi = xf+X^. We first decompose the Rademacher complexity into "parallel" and "perpendicular" 
terms: 



R^iF;{Xi}) 



1, 



n 



sup y^o-i(w • Xj) 



ko <1 



1=1 



-E. 



n 



sup w • ^cJi(x!l +X^ 



\\w\\<l 



1=1 



< Rn{F;S^ + Rn{F;S' 



(4) 



We then proceed to bound the two terms in To bound the first term, note that F restricted 
to T is a function class with linear-algebraic dimension dim(T), and furthermore our assumption 
that the data lies in the unit ball implies that t he range of F is bounded by 1 in absolute value. 
Hence, the classic covering number estimate (see Mendelson and Vershynin 2003l |) 



<.5 



dim(T) 

^^iF,t,U2] 

applies. Substituting ([5|) into Dudley's integral ([2]) yields 

'logAA(t,F,|| 



< t < 1 



(5) 



rc 

Rn{F;S^ < 12 / 
Jo 



-dt 



n 



< 12 



dim(r) log(3/t) 



dt < 17 



n 



The second term in (jH) is bounded via a standard calculation: 



Rn{F; 5'"' 



-E. 



< 



n 



n 



sup W ■ ^2 

^Il2<l i=l 

2\ 1/2 



n 



dim(r) 



n 



(6) 



E. 



1=1 



The hinge- loss is essentially the optimal convex surrogate loss [Ben-David et alll2012| ]. 
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where the second equahty follows from the dual characterization of the norm and the inequality 
is Jensen's. Now by independence of the Rademacher variables cjj, 



l<:i,j<n 



1=1 



^\\PT{X,)-Xi\\l<nrj, 



i=l 



which implies Rn{F; S^) < \/r]/n and together with ([3]) and 1^ proves the claim. 



Corollary 3.2. Let {Xi,Yi) be an iid sample of size n, where each Xi G 
Then, for any 5 > 0, with probability at least 1 — 5, the for all w G with \\w\\2 
k- dimensional subspaces T for which ^ holds, we have 

E[L{w ■X,Y)]<-y L{w ■ Xi, Yi) + 34./^ + 2. + 3 
n ^ V n V 



□ 

< 1. 

< 1, and all 



satisfies ||^i||2 



log(2/^) 
2n '■ 



where L{u,y) = \u\ l{j/«<o} is the hinge loss. 

Proof. Follows from the Rademacher generalization bound in Theorem l2.3^ the complexity estimate 
in Th eorem 13. 1^ and an application of Talagrand's contraction lemma (see iLedoux and Talagrand 
199 1| ) to incorporate the hinge loss. □ 



Implicit in Corollary 13.21 is a tradeoff between dimensionality reduction and distortion. Algo- 
rithmically, this tradeoff may be optimized using PCA. It suffices to compute the sing ular value 
decomposition once, with runtime complexity 0{n^ + Nn"^) Golub and Van Loanl . ll996l |. Then for 
each 1 < k < N, we obtain the lowest-distortion A:-dimensional subspace T^^\ corresponding to 
the top k singular values. We then choose the value 1 < k < N which minimizes the generaliza- 
tion bound of Corollary 13.21 and construct a low-dimensional linear classifier on the projected data 
iPT{xi),yi), {PT{xn),yn), which is "lifted" to M^. 

Note that if is large, the data may be preprocessed using the Johnson-Lindenstrauss transform 
[johnson and Lindenstraussl . 19821 . Ailon and Chazellel . 200a]. This random projection reduces the 

dimension to O (j^^^ while incurring a factor (1 -|- e) interpoint distortion. Taking e = 0{if), 

we achieve dimension N' = O(logn), so that the total cost of the PCA minimization is 0{{n^ + 
N) log n). 

As PCA is already employed heuristically as a d e noising filtering step in the supervi sed classifi- 



cation setting [Bi et al.l . 120031 . iHuang and Aviyentd . l2007l . IVarshney and Willskyl . l201ll | , Corollary 



provides apparently the first rigorous theory for choosing the best cutoff for the PCA singular 
values. 



4 Adaptive Dimensionality Reduction: Metric case 



In this section, we extend the Euclidean approach of Section [3] to the general metric case. Suppose 
{X,p) is a metric space and we receive the training sample ( Xi,Yj \ i = l,...,n , with Xi ^ X 



and Yi G { — 1,1}. Following Ivon Luxburg and Bousquetl |2004l | and lCottheb et al.l [20101], we will 
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construct a classifier via Lipschitz extension (realized as approximate nearest neighbors) — but 
with the added twist of a dimensionality reduction pre-processing step. 

In Section I4.H we formalize the notion of "nearly" low-dimensional data in a metric space and 
discuss its implication for Rademacher complexity. Given S = {xi} C X , we say that S = {xi} C X 
is an (r], Z))-perturbation of S if "^^^i pixi,Xi) < rj and ddim(S') < D. If our data admits an 

D)-perturbation, we can prove that the Rademacher complexity it induces on Lipschitz functions 
can be bounded in terms of r] and D alone (Theorem 14. 3p . independently of the ambient dimension 
ddim(Af). As in the Euclidean case (Theorem 13. ip . Rademacher estimates imply data-dependent 
error bounds, stated in Theorem 14. 4i 

In Section 14. 3| we describe how to convert our perturbation-based Rademacher bounds into 
an effective classification procedure. To this end, we develop a novel bicriteria approximation 
algorithm presented in Section [5l Informally, given a set S C X and a target doubling dimension 
D, our method efficiently computes a set S with ddim(5) ~ D and approximately minimal the 
distortion rj. As a pre-processing step, we iterate the bicriteria algorithm to find a near-optimal 
tradeoff between dimensionality and distor tion. Having found a near-optimal (ry, D)-perturbation 



S, we employ the machinery developed in I Gottlieb et al.l j2010l | to exploit its low dimensionality 
for fast approximate nearest-neighbor search. 

4.1 Rademacher bounds 



We be gin by obtaining complexity estimates for Lipschitz functions in doubling spaces. iGottlieb et al 



2ninl | did this in terms of the fat-shattering dimension, but here we obtain considerably tighter 



bounds by direct control over the covering numbers. The following "covering numbers by covering 



numbers" lemma is a variant of the classic iKolmogorov and Tihomirovl 196ll | estimate: 



Lemma 4.1. Let Fl he the collection of L-Lipschitz functions mapping the metric space {X , p) to 
[—1,1], and endow Fl with the ioo metric: 

IIZ-S'lloo = sup|/(x) -5((2;)| , f,geFL. 
Then the covering numbers of Fl may be estimated in terms of the covering numbers of X : 

N{e/2L,X,p) 

N{e,FLMoo)< 

Hence, for doubling spaces with diam(A') = 1, 

/ AT \ ddim(A') 

logMie, Fl,\\-\U< — log 



e 



Proof. Fix a covering of X using balls of radius e' = ejlL^ whose centers are given by a set of 
cardinality |A^| < M{e/2L,X , p). Define a set F of functions / as follows. At every point in A^, set 
the value of / to be some multiple of e' L/2 = e/4, but keeping only functions / whose Lipschitz 
consta nt is at rnqst 2L \ no w use the classic Lipschitz extension theorem for real- valued functions, 
due to lMcShanel [l934l | and lwhitne^ |l934j . to extend the domain from A^ to all of X. 



We claim that every f G F^ is close to some / G F, in the sense that ||/ — /||oo < £■ Indeed, 
every point x £ X is e'-close to some point xn £ N, and since / is L-Lipschitz and / is 2L-Lipschitz, 

\fix) - f{x)\ < \fix) - f{xN)\ + \fixN) - f{xN)\ + \f{xN) - fix)\ 

< L • p{x,xj\[) + e/A + 2L ■ p{x,xn) = e. 
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It is easy to verify that \F\ < (8/e)'^l, since by construction, functions / are determined by 
their values on N. This provides a covering of Fl using 1^1 baUs of radius e. The bound for 
doubhng spaces follows immediately via Lemma l2.2i □ 

Equipped with the covering numbers estimate, we proceed to bound the Rademacher complexity 
of Lipschitz functions on doubling spaces H 

Theorem 4.2. Let Fl be the collection of L-Lipschitz [—1, l]-valued functions defined on a metric 
space {S,p) with diameter 1 and doubling dimension D. Then 



Rn{FL;S) = 



ni/{D+i) J ■ 



Proof. Recall that < II/II2 implies Af{£,F, H-Hg) < Af{e,F, \\-\\^)- Substituting the estimate 

in Lemma |4. II into Dudley's integral ([2]), we have 



Rn{FL;S) < inf ( 4a + 12 J ^^ihlhMoAdt 



a>0 \ Irv \ n 



1 (iL\D ^ 



< inf I 4a + 12 / \l ^^H-^^^^ijldt 



< inf I 4a + 12 / \l ^ ' dt 



= {{D -1)K)T^ +K (js^T^iiD -1)K)T^ 

< 8K~^^ + DK (k^^ -1) =0{KT^), 



where 



34(4L)^/2 /D-1 
K = — 



n 



Thus, Rn[FL;S) = 0[- 

,i/(£)+i) ) > claimed. □ 



This bound essentially matches the rate for {X^p) = ([0, l]'^, IHI2) |von Luxburg and Bousquetl . 



2004( 1 ■ Finally, we quantify the savings earned by a low-distortion dimensionality reduction. 



Analogous bounds were obtained bv lvon Luxburg and Bousquetl [20041 ] in less explicit form. 
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Theorem 4.3. Let {X, p) be a metric space with diameter 1, and consider the n-point sets S,S C X , 
where S is an {r]n^^^^^^\ D) -perturbation of S. Let Fl be the collection of all L-Lipschitz, [—1,1]- 
valued functions on X . Then 

Proof For Xi € S and Xi G S, define 5^{f) = f{X,) - f{Xi). Then 

Rn{FL\S) = E sup - =E sup - + 

1 " 

< 5) + E sup - V a.Mf). (7) 



4 = 1 



Now the Lipschitz property and our definition of perturbation imply that 



2=1 



<Y,\k{f)\<LY,p{X,,X,)<Lr]n 



D/iD+l) 



i=l 



i=l 



and hence 



E 



sup - V Cri5i{f) < ^7^-T^ 



The other term in ([7]) is bounded by invoking Theorem 14.2 



□ 



4.2 Generalization bounds 

For f : X ^ [—1,1], define the margin of / on the labeled example by yf{x). The 7- 

margin loss, < 7 < 1, that / incurs on is Lj(f{x),y) = min(max(0, 1 — yf{x)/'y),l), 

which charges a value of 1 for predicting the wrong sign, charges nothing for predicting correctly 
with confidence yf{x) > 7, and for < yf{x) < 7 linearly interpolates between and 1. Since 
L-y{f{x),y) < l{sgn(/)7^j/}) the sample margin loss lower-bounds the sample misclassification error. 

Theorem 4.4. Let Fi be the collection of L-Lipschitz functions mapping the metric space X 
with diam(Af) = 1 to [—1,1]. // the iid sample (Xi,Yi) £ X x {—1,1}, i = l,...,n, admits an 
{rjn^ D) -perturbation then for any (5 > 0, with probability at least 1 — 5, the following holds 
for all f € Fl and all 7 G (0, 1).- 

Proof. We invoke Theorem 12.41 to bound the classification error in terms of sample margin loss and 
Rademacher complexity and the latter is bounded via Theorem 14.31 □ 
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4.3 Classification procedure 

Theorem 14.41 provides a statistical optimality criterion for the dimensionality-distortion tradeoflEI. 
Unhke the EucHdean case, where a simple PCA optimized this tradeoff, the metric case required 
a novel bicriteria approximation algorithm, described in Section [5l Informally, given a set S G X 
and a target doubling dimension D, our method efficiently computes a set S with ddim(5) ~ D, 
which approximately minimizes the distortion i]. We may iterate this algorithm over all D E 
{1, . . . , log2 15"!} — since the doubling dimension of the metric space (5, p) is at most log2 \ S\ — to 
optimize the complexitjH term in Theorem 14.41 

Once a nearly optimal (r/n^/^^"''^), D)-perturbation S has been computed, we predict the value 
at a test point x £ X hy a thresholded Lipschitz extension from S, which algorithmically amounts 
to an approximate nearest-neighbor classifier. The efficient implementatio n of this method (a s 
well as technicalities stemming from its approximate nature) are discussed in iGottlieb et |20ld ]. 



Their algorithm computes an e-approximate Lipschitz extension in preprocessing time 20P)nlog n 
and test-point evaluation time 2^^^^ logn + e~^^^\ The latter also allows one to efficiently decide 
on which sample points (if any) the classifier should be allowed to err, with corresponding savings 
the Lipschitz constanl|£| (and hence lower complexity). 



m 



5 Bicriteria approximation of intrinsic dimension and the pertur- 
bation 

For a point v and set T, define d{v,T) = mm^^T d.{v,w). Given two point sets S,T, let the cost of 
mapping S to T he "^^^g d{v,T). Define the low-dimensional mapping problem as follows: Given a 
point set S and a target dimension d, find T C S" with ddim(T) < d, so that the cost of mapping 
S" to T is minimized. As mentioned before, an algorithm for this problem finds a near-optimal 
(?/, D)-perturbation. 

An (a, /3)-bicriteria approximate solution to the low-dimensional mapping problem is a subset 
T' C S, such that the cost of mapping 5 to T' is at most a times the cost of mapping S to an 
optimal T (of ddim(T) < d), and also ddim(T') < f3d. We prove the following theorem. 

Theorem 5.1. An [0(1), 0{1)) -bicriteria approximate solution for the low- dimensional mapping 
problem on input S of size n = \S\ can be computed in time 2'-' ^'^^^^^ n -\- 0{n log^ n) . 

In presenting the algorithm, we first give in Section 15.11 an integer program (IP) that models 
this problem, and show that an exact solution to the integer program gives a good bicriteria 
approximation to the minimum mapping cost problem (Lemma 15. 2p . However, finding an exact 
solution to the IP seems difficult; we thus relax in Section [5.21 some of the IP constraints, and derive 
a linear program (LP) that can be solved in the runtime stated above (Lemma 15. 4p . Further, we 
give a rounding scheme that recovers an integral solution from the LP solution, and then show in 
Lemma [5. 31 that the integral solution matches the bicriteria approximation of Theorem 15. 11 thereby 
completing its proof. 

Although the estimate in Theorem 14.31 was given as 0{L{1 + ri)/jn^''^'^'^^^) for readability, its proof yields 
explicit, easily computable bounds. 

^ Since L/7 multiplies (1 + rf) /in}^^'^^^^ in the error bound, the optimization may be carried out oblivious to L 
and 7. 

® Note that the complexity term in Theorem l4.4l scales as L/^ and hence the final classifier can always be normalized 
to have Lipschitz constant 1 — so no further stratificatio n over L is necessary. We do, however, need to stratify over 
the doubling dimension D (see IShawe-Tavlor et al.l [l998l ]l. 
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Remark. In what follows, we present a solution with very large (though constant) approximation 
factors. The techniques presented below can yield much tighter bounds, if we only choose to create 
a series of many hierarchies instead of a single one. We have chosen the current presentation for 
simplicity. 

5.1 An integer program 

The integer program below encapsulates an approximate solution to the low-dimensional mapping 
problem, and will motivate the introduction of a linear program in Section 15.21 Denote the input 
by S and d, and let 5 be a hierarchy for S with a 1-covering property. We assume that the 
minimum interpoint distance in S is rj, so that the hierarchy possesses t = [log2(l/?/)] levels. (This 
assumption may be imposed on a set by incurring an additive per-point distortion of ij.) A solution 
to the integer program will yield a set T with a hierarchy T S. We introduce a set Z of 0-1 
variables for the hierarchies; variable S Z corresponds to a point Vj G Si, and is an indicator 
for Vj being present in Tj G T. Clearly \Z\ < nt, where t is the number of levels in the hierarchy. 
When convenient, we may refer to distance between variables where we mean distance between 
their corresponding points. 

We further introduce a set C of n variables Cj, where each variable cj will approximate the 
point mapping cost d{vj,T). 

Now, we wish to define the i-level neighborhood of point vj to be the net-points of Si which are 
relatively close to Vj. More formally, when vj G Si, let Ej C Z include all variables z| for which 
d{vj,Vk) < e • 2~*, for e := 6. If Vj ^ Si, then let w £ Si he the nearest neighbor of Vj in Si, 
and define Ej C Z to include all variables for which d[w,Vk) < e ■ 2~*. We will similarly define 
two more neighbor sets: C Z includes all variables for which d{w,Vk) < / • 2~*, for / := 16. 
Cj C Z include all variables for which d{w,Vk) < g ■ 2"*, for g := 150. The integer program 
follows. 
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mill 




subject to z\ G |0, 1| 


J 


(8) 


z]>z] 

— J 


Vi, 2*+^ E Z 


(9) 
(10) 






(11) 




Vz, G S 


(12) 




\/vj G 5 


(13) 


4 + ^ > 1 


Vi)j G S* 


(14) 


X] ^ - #log(23d+l) X] ^ 
2gF] zeFj= 


Vi < /c, Vj G S 


(15) 



We prove that a solution to this IP gives a hierarchy providing a bicriteria approximate solution 
to the low-dimensional mapping problem. (As an aside, we note that constraints ()lip and (jl5p are 
not necessary for the purposes of the following lemma, but will later play a central role in the proof 
of Lemma 15.31 ) Let T* be an optimal solution for the low-dimensional mapping problem on input 
{S,d), and let C* be the cost of mapping S* to T*. 

Lemma 5.2. A feasible solution to the integer program implies a set T with doubling dimension 
ddim(T) < (3 log 150)(i -|- o(l). The cost of mapping S to T is at most 64C*. 

Proof. We first show that a solution to the IP yields a hierarchy with favorable properties, similar 
to the guarantees of Lemma 12.11 Constraint ([9]) enforces the nested property of the hierarchy: If 
Vj appears in Tj, then Vj will appear also in Tj+i. Constraint (llip enforces a covering property: If 
Vj appears in T, there exists in each level Tj a point which 8-covers Vj. (Note that the hierarchy 
of Lemma l2.ll possesses 4-covering points. By construction, contains all points of Si within 

distance e • 2~* — 2 • 2~* = 4 • 2~* of Vj - that is, all 4-covering points of Vj - along with other points 
that are at distance at most 2 • 2~* -|- e • 2~* = 8 • 2~* from Vj.) Constraints (llip and ()12p enforce 
a packing property for the hierarchy, stipulating that any ball of radius / • 2~* (or g ■ 2~*) may 
contain at most y3d+o(i) ^^3d+o(i)^ points with minimum interpoint distance 2~*. By Lemma l2.lt 
there exists a hierarchy for T which is a subset of S and has doubling dimension at most 3d + o{l), 
so there exists a hierarchy which satisfies all constraints. 

(As an aside, note that the 8-covering property already implies that condition (|15p is satisfied: 
By condition (jlip . Fj contains at most jlog(2^'*+i) points, so condition ()15p implies only that 
'^zgF'- — 1 whenever contains at least one non-zero variable. Further, if F^ contains any 
non-zero variable, then that variable is at distance at most / • 2~'^ from Vj, and by the 8-covering 
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property there must exist a covering variable in Tj set to 1, within distance / •2 '^ + 8- 2 *</-2 * 
of Xj. This variable is necessarily found in Fj, so indeed '^z&F^ ^1-) 

To calculate the doubling dimension of T, recall that every point Vj G T is within distance 
8 • 2~' of some point in Tj. Hence, all points of T covered by a ball of radius 16 • 2~* centered at 
some point v £ Ti are covered by balls of radius 8 • 2~* centered at net-points of Tj within distance 
(16 + 8) • 2~* = 24 • 2~* of V. By constraint (fT2]) . there are at most (^3d+o(i)) net-points of Tj within 
this distance of v, implying a doubling dimension of log(2^'^ + 1) logg = (Slog 150)d -|- o(l). 

Constraint (jl3p calculates a penalty Cj for point Vj not included in T: If Vj ^ T, then zj = 0, 
and if also J2zeF' -2 = 0, then the penalty Cj must be set equal to at least 2~*. We will show that 
Cj is within a constant factor of the true penalty d{vj,T). In what follows, let Vk be the closest 
neighbor to Vj in T, and let 2"^ < d{vj,Vk) < 2~^'P~^\ 

First, we must show that Cj is not too large. Indeed, whenever i < p, Vj will not incur the penalty 
of constraint (fT3l) : Vk is 8-covered by some point in u; G Tj, and so d{vj,w) < d{vj,Vk) + d{vk,w) < 
2-P+i -)- 8 . 2^* < 9 • 2~*. Now, the distance from Vj to the closest point in Sj is less than 2 • 2~*, so 
ID is within distance 9 • 2~* + 2 • 2~* = 11 • 2~* < / • 2~* of the center point of Fj, so w^s variable is 
included in Fj. It follows that J2zeF^ z > 1, and so Cj does not incur a penalty from level i. 

Next, we must show that cj is not too small. We will show that whenever i > p+5, vj incurs the 
penalty of constraint (jl3p . The distance from vj to any point of F^ is at most 2-2~*-|-/-2~* >= 18-2*. 
Since the distance from Vj to Vk is as least 2~^ = 32 • 2~'* > 18 • 2~*, no point of FJ^ is contained 
in T. It follows that Yl^^^pi z = 0, and so Cj must be set equal to at least 2*. (We note that when 

p >t — 4, constraint (I14p ensures a minimal penalty of r/.) 

It follows that the value of Cj is determined by some level p < i < p + 5, and is therefore in the 
range • 2-^,2-^]. Since the correct penalty is less than 2 P'^^, the assessed penalty is within 
factor 64 of the correct penalty. □ 

5.2 A linear program 

While the integer program gives a good approximation to the low-dimensional mapping problem, 
the IP itself seems difficult to solve. Instead, we create an LP by relaxing the integrality constraints 
([8]) into linear constraints Zj G [0, 1]. Of course, any integer solution satisfying the IP satisfies the 
LP as well. In Section 15.31 we will show that this LP can be solved quickly. 

After solving the LP, we recover a solution to the low-dimensional mapping problem by rounding 
the variables to integers, as follows: 

1. If > i, then Zj is rounded up to 1. 

2. For each level i = 0, . . . ,t: Let J^* be the set of all neighborhoods Fj. Extract from J^* a 
maximal subset J^* whose elements obey the following: (i) For each Fj G -F* there is some 
k > i such that XlzeF'^ — i- (ii) Elements of F* do not intersect, (iii) The distance from 
Fj G F* to any variable already rounded to 1 is at least e • 2~*. 

For each element Fj G F*, we round up Zj to 1. Likewise, every variable Zj with k > i is 
rounded to 1 as well. 

3. All other variables of Z are rounded down. 

The rounding steps yield a set T with hierarchy T. The following lemma completes the first 
half of Theorem 15.11 
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Lemma 5.3. T is a (464, 61og2 150 + o(l))-&icriteria approximate solution to the low- dimensional 
mapping problem on S. 

Proof. We enumerate three properties of the hierarchy T. 

(i) Nested: When a variable of level i is rounded up in rounding step ([2]), all corresponding variables 
in levels k > i are also rounded up. This implies that T is nested. 

(ii) Packing: Take a ball of radius g ■ 2* centered at any point v £ Si. By constraint (jl2p . the sum 
of variables corresponding to points in this ball is at most g'^^~^°^^\ When i = t, step [T] rounds 
up only variables of value ^ and higher, and so the ball may contain at most 2 • g^^+°(^) points of 
St. When i t, note that condition ([TS]) implies that J2zeF^ ^ — 4jiog(23d+i) ' since X^^g^t ^ — i 

for some k. Hence, rounding up a variable of Fj places at most 4g'^'^'^~^°^^^ = 4:g^'^~^°^^^ additional 
points of Si into the ball. Now, rounding step ([2]) may add to this ball points from levels k < i, 
but since points in each level k obey a packing property 2^^, and the radius of our ball is at most 
g ■ 2*, each level can add a most a constant number of points to the ball, for a total of 0{t) points. 
It follows that the total number of points in the ball is less than bg^"^ + 0(1). 

(iii) Covering: We first show that any variable rounded to 1 in rounding step ([1]) is 50-covered 
in each hierarchical level T^. Since zj > ^, condition (fTO]) gives that J2zgf^ ^ — S^gE' z > ^ in 
the LP solution. Hence, by construction of the rounding step ([2]), either Zj will be rounded to 1, 
or a variable within distance 3/ • 2~'* of the center of FJ: is rounded to 1. The claim follows from 
recalling that the distance from Vj to the center of is at most 2-2*. 

We then show that every variable Zj rounded to 1 in rounding step ([2]) is 48-covered in each 
hierarchical level k < i. By condition (jl5p . since Zj was chosen to be rounded, rounding step ([2]) 
for level k assures that either Zj or a variable within distance 3/ • will appear in T^. This point 
48-covers Zj. 

Having enumerated the properties of the hierarchy, we can now prove the doubling dimension 
of T. Take any ball B of radius 100 • 2~* centered at a point of Tj. Since every point of Tk is 
50-covered by some point in Tj, the points of Tk covered by B are all covered by a set of balls of 
radius 50 within distance 150 • 2~^ = g ■ 2~^ of the center point. By the packing property proved 
above, there exist 5g^'^ + 0(1) such points, implying a doubling dimension of Gdlog 150 + 0(1) 

It remains only to bound the mapping cost. Recall that the cost of optimal solution of the 
LP is not greater than the cost of the optimal solution for the IP, which is within a factor 64 
of the optimal solution. Consider the cost Cj assigned to the variable zj. If after the rounding 
d{vj,T) < Cj, then we have only gained from rounding this value. Hence, let us take a variable 
Zj which is less than ^ in the LP solution, and was subsequently rounded down. We must only 
prove that Cj is not much less than d{vj,T). First note that by condition cj > ^. Now, take 

the highest level i for which cj < ^ by condition (fTO]l . it must be that J2z£F' — I- Then by 

rounding step ([2]), a variable within distance 2 • 2"* + 3/ • 2~'* = 50 • 2^' of vj must be rounded. 
Hence, the assigned penalty Cj > ^—g — = ^g- is within factor 400 of the true penalty. Joining 
together the analyses for two cases where cj is larger or smaller than the true cost, we achieve an 
approximation of 64 + 400 = 464 to the true cost. □ 



5.3 LP solver 



To solve the linear program, we utilize the framework presented bv lYounel [200 1[ | for LPs of following 
form: Given non-negative matrices P, C, vectors p, c and precision /? > 0, find a non-negative vector 
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X such that Px < p (LP packing constraint) and Cx > c (LP covering constraint). Young shows 
that if there exists a feasible solution to the input instance, then a solution to a relaxation of the 
input program, specifically Px < (1 + /3)p and Cx > c, can be found in time O (mr (log m)//3^), 
where m is the number of constraints in the program and r is the maximum number of constraints 
in which a single variable may appear. We will show how to model our LP in a way consistent with 
Young's framework, and obtain an algorithm that achieves the approximation bounds of Lemma 
5.31 with the runtime claimed by Theorem 15.11 Lemma 15.41 below completes the proof of Theorem 
531 



Lemma 5.4. An algorithm realizing the bounds of Lemma \5.3\ can be computed in time 2^^'^^^^^^^^n+ 
0{nlog^ n). 

Proof. (Sketch.) To define the LP , we must first create a hierar chy for S, which can be done in time 



min{0(tn2), 20(ddim(s))^^}^ 

as in [Krauthgamer and Led . l2004i | . After solving the LP, the rounding 



can be done in this time bound as well. 

To solve the LP, We first must modify the constraints to be of the form Px < p and Cx > c. This 
can be done easily by introducing complementary constraints Zj S [0, 1], and setting Zj + Zj = 1. 
For example, constraint Zj > now becomes Zj -\- Zj > 1. A similar approach works for the other 
constraints as well. 

We now count the number of basic constraints. Note that j S [i-,n] and i S [l,i], so a simple 
count gives m = 0{t^n) constraints (where the quadratic term comes from constraint (llSp ). To 
bound r, the maximum number of constraints in which a single variable may appear, we note that 
this can always be bounded by 0(1) if we just make copies of variable Zj. (That is, two copies of 
the form zj' = zj, zj" = zj, then two copies of each copy, etc.) So r = 0(1) and the bound on m 
increases to 0{t^n + n\ogn). 

Finally, we must choose a value for /?. The variable copying procedure above creates a depen- 
dency chain of O(logn) variables, which will yield additive errors unless (3 = 0(1/ log n). Similarly, 
constraint dH) creates a chain of 0{t) variables, so /3 = 0(l/t). It suffices to take j3 = 0(l/(t log n)), 
and the stated runtime follows. □ 
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